library(Seurat)
library(data.table)
library(NMF)
library(rsvd)
library(Rtsne)
library(ggplot2)
library(cowplot)
library(sva)
library(igraph)
library(cccd)
library(KernSmooth)
library(beeswarm)
library(stringr)
library(formatR)
source("tools.R")
library(DESeq2)
Condition message Step 1: Ignore control condition,use all sample Step 2:Under the control condition Positive Negative Tube_1 Tube_2
human.only.pro <- Load_data(data_dir = "./human.txt")
important.genes <- c("ITGB4", "ABCB5", "KRT19", "ACTB", "KRT12", "KRT5", "GAPDH",
"KRT3", "PAX6", "WNT7A", "KRT14", "TP63", "KRT10")
table(unlist(lapply(colnames(human.only.pro), function(x) return(str_split(x,
"_")[[1]][2]))))
##
## 10um 20um 2um 6um
## 326 575 18 180
table(unlist(lapply(colnames(human.only.pro), function(x) return(str_split(x,
"_")[[1]][1]))))
##
## hc001 hc006 hc009 hc012 hc017 hc018 hc020 hc021
## 15 92 187 66 188 188 184 158
## shoutiao
## 21
# only select the cells contain 10 genes expressed at least,select the genes
# must be expressed in two cells at least
human.all.DESeq <- DESeq_SeuratObj(X = human.only.pro, DESq = FALSE, min.cells = 10,
min.genes = 2)
## [1] "Scaling data matrix"
##
|
| | 0%
|
|=================================================================| 100%
all.sample.group <- unlist(lapply(human.all.DESeq@cell.names, function(x) return(str_split(x,
"_")[[1]][1])))
all.sample.size <- unlist(lapply(human.all.DESeq@cell.names, function(x) return(str_split(x,
"_")[[1]][2])))
# reset ident human.all.DESeq<-SetIdent(human.all.DESeq,cells.use =
# human.all.DESeq@cell.names,ident.use = all.sample.size)
Group_Bar(human.all.DESeq@raw.data, group = all.sample.group)
Group_Bar(human.all.DESeq@raw.data, group = all.sample.size)
# We are interested in the gene ITGB4
GenePlot(human.all.DESeq, gene1 = "ITGB4", gene2 = important.genes[2])
# VlnPlot(human.all.DESeq,features.plot = 'ITGB4',y.lab.rot = 90) # Violinn
# plot of gene ITGB in all sample
VlnPlot(human.all.DESeq, features.plot = important.genes[important.genes %in%
rownames(human.all.DESeq@raw.data)], y.lab.rot = 90) # Violinn plot of gene ITGB in all sample
According to the plot of explore data,we can know that the gene ITGB4, ABCB5, KRT19, ACTB, KRT12, KRT5, GAPDH, KRT3, PAX6, WNT7A, KRT14, TP63, KRT10 are expressed differently across the sample group.It is fit to what we are interested.And,across the sample(Barplot),there are many genes that actually expressed different.The barplot of sample group tells us that the gene expression level in hc001 is almost 0.So we consider remove the sample group hc001. Next,based on the explore plot,we will analysis data deeply with other method
## [1] "Scaling data matrix"
##
|
| | 0%
|
|=================================================================| 100%
Here,do the dimensionality reduction using the PCA, tSNE method
all.pbmc <- PCA.TSNE(object = human.all.DESeq, pcs.compute = FALSE, num.pcs = 28)
# FeaturePlot(object = all.pbmc,features.plot ='ITGB4',pt.size = 4,no.legend
# = FALSE) # ITGB4 gene in part dataset
FeaturePlot(object = all.pbmc, features.plot = important.genes[important.genes %in%
rownames(human.all.DESeq@raw.data)], pt.size = 1, no.legend = FALSE, reduction.use = "pca") # ITGB4 gene in part dataset
DimPlot(all.pbmc, reduction.use = "tsne", pt.size = 4) # grour by sample
DimPlot(all.pbmc, reduction.use = "pca", pt.size = 4) # grour by sample
DimHeatmap(all.pbmc, reduction.type = "pca", check.plot = FALSE)
FeatureHeatmap(all.pbmc, features.plot = "ITGB4", pt.size = 3, plot.horiz = TRUE,
cols.use = c("lightgrey", "blue"))
The Faetureplot of ITGB4, ABCB5, KRT19, ACTB, KRT12, KRT5, GAPDH, KRT3, PAX6, WNT7A, KRT14, TP63, KRT10based on PCA shows that,they only has high expression level in few samples,and expresss lowly in most sample.It means that may be these important genes express differently across sample.The plot also tell us the gene KRT5,GAPDH,PAXX6,KRT14 have more higher expression level than the other important genes.It is consistent with the result of violin plot. About the heatmap,we only show the gene ITGB4 And the FeatureHeatmap and Heamap also comfirm this phenomeno.We try the other four variable genes,which has the similar result as gene ITGB4 But the tSNE and * PCA * plot show that, the sample can not be split apparently.The result may be is not good based on the PCA and tSNE method.
Next,we will have analysis on gene differential expression.Find maker genes across sample.We use the method: **wilcox test**
# Finds markers (differentially expressed genes) for each of the identity
# classes in a dataset
all_markers <- FindAllMarkers(all.pbmc, test.use = "bimod", print.bar = FALSE)
head(all_markers)
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster
## MTRNR2L1 1.360725e-131 -2.2312801 0.783 0.509 2.792752e-127 hc006
## AC009501.4 2.734415e-117 -1.7443481 0.717 0.266 5.612113e-113 hc006
## MTATP6P1 5.006631e-73 -2.1743262 1.000 0.907 1.027561e-68 hc006
## BEST1 1.157891e-69 0.8045069 0.859 0.049 2.376456e-65 hc006
## PEX26 2.194607e-65 -0.5001337 0.522 0.084 4.504211e-61 hc006
## ZNF664 1.789094e-64 0.4046469 0.935 0.123 3.671936e-60 hc006
## gene
## MTRNR2L1 MTRNR2L1
## AC009501.4 AC009501.4
## MTATP6P1 MTATP6P1
## BEST1 BEST1
## PEX26 PEX26
## ZNF664 ZNF664
We check whether the important genes are still in the marker genes we found from the DESeq analysis. the genes:ITGB4, KRT19, ACTB, KRT5, GAPDH, KRT3, PAX6, KRT14, TP63, KRT10 are still in the marker genes.
human.heatmap <- Heatmap_fun(genes = important.genes[important.genes %in% rownames(human.all.DESeq@raw.data)],
tpm.data = all.pbmc@scale.data, condition = unique(as.character(all.pbmc@ident)),
all.condition = as.character(all.pbmc@ident))
## There ara 8 conditions
## Whether creat data accurate 0
NMF::aheatmap(human.heatmap[[2]], Rowv = NA, Colv = NA, annCol = human.heatmap[[1]],
scale = "none")
We have find all marker genes across sample,there are 5065 significant genes(adjust p-value <0.05) in all marker genes.
all.pbmc <- KClustDimension(all.pbmc, reduction.use = "pca", k.use = 9)
clusters.pca <- all.pbmc@meta.data$kdimension.ident
DimPlot(all.pbmc, pt.size = 4, group.by = "kdimension.ident")
all.pbmc <- KClustDimension(all.pbmc, reduction.use = "tsne", k.use = 9)
clusters.tsne <- all.pbmc@meta.data$kdimension.ident
DimPlot(all.pbmc, pt.size = 4, group.by = "kdimension.ident", reduction.use = "tsne")
When use the DESeq,it must require the gene count matrix satisify that: every gene contains at least one zero, cannot compute log geometric means. So have to take another method to handle data,but I do not know whether it is reasonable.Just try!!!
# sample group
condition.1 <- unlist(lapply(all.pbmc@cell.names, function(x) return(str_split(x,
"_")[[1]][1])))
# condition.2<-unlist(lapply(colnames(human.only.pro),function(x)return(str_split(x,'_')[[1]][2])))
load("Human.sample.RData")
plotDispEsts(xdds, main = "Per-gene Dispersion")
res <- results(xdds, contrast = c("condition.1", "hc009", "hc006"))
res[which(res$padj < 0.05), ]
## log2 fold change (MLE): condition.1 hc009 vs hc006
## Wald test p-value: condition.1 hc009 vs hc006
## DataFrame with 12296 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## A2M 2.289971 2.206884 0.2321168 9.507642 1.950334e-21
## A2ML1 1.262714 -0.798983 0.1739448 -4.593313 4.362640e-06
## A4GALT 2.169455 -1.159624 0.1904962 -6.087384 1.147705e-09
## AAAS 2.185565 -1.773275 0.1911290 -9.277899 1.728531e-20
## AAGAB 7.465588 -2.233763 0.2809110 -7.951854 1.837409e-15
## ... ... ... ... ... ...
## ZXDC 2.039650 -1.5112369 0.1839614 -8.214967 2.122222e-16
## ZYX 10.361767 -3.4962954 0.2837317 -12.322539 6.850287e-35
## ZZZ3 4.549581 0.9134068 0.2591173 3.525070 4.233704e-04
## snoU13 3.382409 -0.8434911 0.2193144 -3.846036 1.200443e-04
## uc_338 1.142337 -0.4314247 0.1669928 -2.583493 9.780544e-03
## padj
## <numeric>
## A2M 1.553356e-20
## A2ML1 1.049662e-05
## A4GALT 3.725465e-09
## AAAS 1.292053e-19
## AAGAB 9.438512e-15
## ... ...
## ZXDC 1.179148e-15
## ZYX 1.281859e-33
## ZZZ3 8.481449e-04
## snoU13 2.545476e-04
## uc_338 1.660705e-02
the table are the result which gene’s padj value<0.05.Include the variable:baseMean,log2FoldChange lfcSE,stat,pvalue,padj.
library(VennDiagram)
grid.draw(venn.diagram(x[1:4], filename = NULL, fill = c("dodgerblue", "goldenrod1",
"darkorange1", "darkorchid1")))
Venn diagram show that the common differentially expressing genes across gorups.The number represents the the number of common genes.Here just show the randomly selected groups